DOCUMENT EESUHE 
129 9221 TH 005 801 

^kUTHO? Kosecoff^ Jacqueline; Fink^ Arlene 

TITLE The Feasibility of Using Criterion-Referenced Tests 

for Large-scale Evaluations. 
PUB DATE ' Apr 76 ' 

NOTE '58p.; Paper presented at the Annual Meeting of the 

American Educational Research Association (60th, San 
Francisco.r California, April 19-23, 1976 

EDRS PRICE MF-$0.83 HC-$3»5C Plus Postage,' 

DESCRIPTORS Criteria; ^Criterion Referenced Tests; Definitions; 

^Feasibility Studies; * Program Effectiveness ; j 
*Prograin Evaluation; Scores; Test Constructionf; Test 
Interpretation; Test Reliability; Test Reviews; Test 
Selection; Test Validity 



ABSTRACT 

The feasibility of us.inq criterion referenced tests 
(CRTs) in a large-scale evaluation conducted in an effectiveness 
evaluation context .was invest igated, ,The study began by examining the 
iiheory that structures the development and validation of CRTs to 
discover whether, on theoretical grounds alone; CRTs are suitable or 
nbt suitable for large-sca],e effectiveness evaluations. Next, a set 
of criteria were developed for selecting tests appropriate for such 
evaluations. Included within the set of criteria was the stipulation 
that the test be able to provide scores amenable to CRT 
iiiterpretation. Twenty-eight currently available CRTs were then 
reviewed, using the set of criteria. Finallj, based, on .theoretical 
examination and the review, conclusions were drawn. Based on 
practical, not theoretical, considerations, it was concluded that 
there is no currently available CRT that is feasible for use in 
large-scale effectiveness evaluations. (RC) 



* Documents acquired by ERIC include many informal unpublished • .* 

* materials not available from other sources. ERIC makes every effort * 

* to obtain the best copy available. Nevertheless, it^ms of marginal * 

* reproducibility are^often encountered and this affects thequality * 

* of the microfiche and hardcopy reproductions ERIC makes av^lable * 

* via the ERIC Document Reproduction Service (EDRS) . EDRS is not * 

* responsible for the quality of the original document.j^ Reproductions * 

* supplied by EDRS are the best that can be made from the original. * 



ERIC 



# 



us DEPARTMENT OF HEALTH. 
EDUCATION * WELFARE 
NAT'ONAL INSTITUTE OF 
EDUCATION 

TM.S OOCUMEN-r HAS BEEN REPRO- 
DUCED EXACTLY AS PECEIVED PROM 
THE PERSON OR ORGANIZATION ORIGIN- 
ATiNGiT POiNTSOr VIEW OR OPINIONS 
STATED DO NOT NECESSARILY REPRE- 
SENT QF I- iC I AL NATIONAL INSTITUTE OP 
EOUCATION POSITION OR POLICY 



The Feasibility of Using > 
Criterion-Referenced Tests^for 
Large-Scale Evaluations, 



pj Jacqueline Kosecoff, Ph.D. and ^^rlene Fink, Ph.D* 
^ Center for the Study of Evaluation' 

Q University of California at Los Angeles 

LU 



EKLC 



*The authors wish to express their ^ 
gratitude to Penelope Morgan v/ho' contributed important 
ideas to this investigation and assisted in reviewing tests. 



o 

GO 

A paper presented at the Annual Meeting 
^""^ of the American Educational Research Association* ^ 

* San Fransisco, April 1975. 



2 



Criterion-referenced tests are becoming increasingly popular anxDng 
eCducators and psychometricians. Perhaps the most important reason for their 
appearance and widespread acceptance can be traced to the new ways that had 
to be found to measure the effects of the educational reforms of the 19b0's 
and 1960's. During those decaaes, the conventional school curriculum was 
declared in need of reform, and a reassessment of the goal s' and objectives, 
of American education was made. (Hofstadtar,, 1963; Davis and Diamond, 1974; 
Cronbach and Suppes, 1969). Innovative courses of study and instructional 
technologies were subsequently developed, and prograrmied learning and Indivi- 
dualized instruction became comaionly-used teaching approaches. New ways of ^ 
assessing student performance were needed that corresponded to the innovations, 

Educato'rs have traditionally relied on paper and penc^4 .achievement 
tests to measure learning, so it was natural for them to turn to test theore- 
ticians to provide them with alternative ways of .interpreting performance on 
measures of .educational achievement for the new curriculum and instruction. 
The psycho^tricians responded by pointing to two basic ways of assigning 
meaning to test scores. The first involved comparing one person's or group's 
performance or behavior with another person's or group's, and the second in- 
volved describing what a person or group can ^o or can be expected to do. 
Glaser (1963) referred to these two ways. of giving, meaning to test scores as 
"norm-referenced'* and "cri terion-referenced/ and recommended criterion- 
referenced score interpretations for the reformed curriculum and instruction. 
According to Glaser and his colleagues, "A criterion-referenced test is one 
that is deliberately constructed to give scores that tell what kinds of beha- 
vior individuals with those score can demonstrate" (Glaser and Nitko, 1971). 



The reaction to criterion-referenced tests (CRTs) was enthusiastic 
from the start. Because they provide score interpretations in terms of 
the achievement of specific and measurable skills, and beh^aviors, CRTs have 
had appeal to those directly responsible for the education of students and 
the development and evaluation of educational programs. They also have. had 

/appeal to teachers .who found -the results. of standardi zed. tests ir^dequate 
to assist them in planning lessons, and to many educators and psychollDgists 
who judged standardized, norm-referenced tests to be unfair and even biased 

.against individuals from under-privileged and minority groups. Finally, 
because the criterion-referenced approach was new, people saw it as an oppor 
tunity to improve on -^me of tihe mistakes they perceived to be built into 
norm-referenced testing. • » 

CRT*s popularity and sanction by theoreticians and practitioners has 
led to their frequent use for instructional diagnosis and placement and for 
measuring Student achievement on educational tasks or objectives. In addi- 
tion, CRTs are being suggested ^or'^ used for other purposes like the evalua- 
tion of educational programs and the National Assessment of Educational 
Progress (Wilson, 1974). In fact, many jfecently-issued requests for propo- 
sals from state and federal agencies to evaluate educational programs have 

i 

specifically required prospective contractors to justify their selection of 

\ - . 

standardized rather than CRT treasures. 

The purpose of this paper is to investigate the feasibility \f using 
criterion-referenced tests in a large-scale evaluation conducted in an 
effectiveness evaluation context. 

4 

. 2 , • 



The investigation began by examining the theory that structures the 
developnx^nt and, val idation of CRTs to discover whether, on theoretical 
grounds alone, CRTs are suitable or not suitable for large-scale effec- 
tiveness evaluations. The next step was to develop a set of criteria for 
selecting tests appropriate for such evaluations; Included within the set 
of criteria v/as the stipulation tFiat the test be able to provide scores 

amenable to CRT interpretation. Currently available CRTs were then re- 
viewed, using the set of criteria. Finally, based on the theoretical 

examination and the review, conclusions were drawn. This paper describes 

the investigation, and is. organized into four parts: 

• lue Effectiveness Evaluation Context 

. A Theoretical Examination of Criterion-Referenced Testing 

. Review of Currently Available CRTs 

. Conclusions • 

^ The Effectiveness Evaluation Context 

Evaluation is a set of procedures used to appraise an educational 
program's merit and to provide information about the nature and quality of 
the program's goal s , outcomes, impact, and costs (Fink and Kosecoff,. 1976). 

Evaluation Contexts 

There are two contexts in which evaluations of .education'al programs are 
conducted. In one context, an evaluation is conducted to' improve a program, 



and the evaluation's clients are typically the program's organizers and 
staff. In the second context, an evaluation is conducted to measure the 
effectiveness of a program, and the evaluation's clients are typically 
the program*s sponsors. The context for an evaluation is determined by 
the information needs of the individuals and agencies who muM: isr the 
evaluation information. 

An evaluation is performed in an improvement context when the evalua- 
tion's clients are concerned with finding out precisely where a change 
would make the program better. Typically, the organizers of a still- 
developing program require this kind of information so that they can modify 
and improve the program. On the other hand, an evaluation is conducted in 
an effectiveness context when the evaluation's^ clients are particularly 
concerned with determining the consistency and efficiency with which the 
program achieves desired res^^ts. Those 'individuals' who sponsored program 
development, or who Are interested in using the program, require this kind 
of information about a well-established program's outcomes and impact. In 
addition, in an effectiveness context, the evaluator usually makes use of 
powerful, experimental design strategies that permit comparisons, rely on 
empirically-validated and standardized instruments, and employ statistical 
and other analytic methods that allow inferences regarding the program's 
comparative value. Finally, in an effectiveness evaluation, the evaluator 
usually. assumes 'a more global and independent stance toward the program 
than in an- improvement context. 



It is generally agreed (e.g., Alkin, et al , 1974) that information 
collection strategies for large-scale evaluations should rely upon instru- 
ments that have been demonstrated to be valid and reliable for the target 
population, and that are known to provide relevant information. 



A Theoretical Examination of Criterion- 
Referenced Testing^ 

In thi s ^section', theoretical Isn^es in the f^^velopment and validation 
of CRTs will be discussed. These include a definition of CRTs, the formu- 
lation and generation of CRT objectives and items, score interpretation 
schemes, establishing item and test quality, and the use of classical indexes 
of reliability and validity. Based on this discussion, the theoretical appro- 
priateness of CRTs in effectiveness evaluation contexts will be investigated. 

Definition . • . 

A criterion-referenced test is one that 'is designed to provide a 
measure of the extent to which educational purposes or tasks have been 
achieved. All CRTs share several features in cornnon: 

1. They are based on clearly-defined educational tasks 

and purposes. * ^ ' • 

1. Test items are specifically designed to measure the purposes 
and tasks. - 

5 



3. Scores are interpreted in terms of attainn)ent of a pre-set 
criterion or level of competence with respect to the 
purposes and tasks. 

' 'Other definitions of CRTs have also been offered. Three of the most 
often-used definitions are: 

1. "A criterion-referenced test is one that is deliberately 
constructed to yield measurements that are directly interpreta- 
ble in terms^ of specified performance standards. . .Performance 
standards are generally specified by defining a class or domain 
of tasks that should be performed by the individual" (Glaser 
and Nitko, 1971). . 

2. "A pure criterion-referenced test is one consisting of a sample 
of production tasks drawn from a well-defined population of per- 
formances, a sample that may be used to estimate the proportion 
of performances in that population at which the student can 
succeed" (Harris and Stewart, 1971). 

3. "Criterion-referenced measures are those which are used to 
ascertain an individual's- status with respect to some criterion, 
i.e., a performance standard" (Popham and Husek, 1969). 

While these definitions differ considerably in terms of the limitations 
« 

and constraints placed on a criterion-referenced test, ^ they all involve re- 
porting test scores in terms of achievement of educational tasks. 

6 • 
8 



A question frequently asked about cri terioh-referenced tests concerns 
their relationship to norm-referenced tests. To answerHhis question 
briefly, the crucial difference between these tests is the metric used to 

s 

describe their scores. Norm-referenced tests report scores that are in- / 
tended to permit comparisons or rankings and use metrics like percentiles 
. : and staaines. Crrterion-referenced tests report- scores in terms o.f levels 

of competence of achievement with respect to a performance criterion and 
use metrics like mastery or percent of an objective achieved. |An other 
differences between norm-referenced and criterion-referenced tests, like 
the way each is developed and validated, ane derived from the need to pro- 
- duce tests- that permit the appropriate score interpretation. 

. • / 

Development of Criterion-Referenced Tests 

Formulating and* generating objectives . One of the basic features of 
CRTs' is their foundation on a clearly-defined set of educational tasks and 
purposes. CRT objectives can be selected in at least six ways: 

1. Expert judgment. Experts assess, on the basis of their knowledge 
and experience in the field, which educational tasks and purposes 
are the most important to meai^ure. , 

2. Consensus judgment. Various groups such as community, representa- 
tives, curriculum experts, teachers, anff/or school administrators 
decide which educational tasks and purposes they consider to be 
the most important to measure (Klein, 1972; Wilson, 197?). \ 

9 

7 

ERiC 



3. Curriculum analysis. A team of curriculum experts analyzes a set 
of curriculum materials in order to identify, and, where necessary, 
infer the educational tasks and purposes that are the focus of the 
test (Baker, R.L., 1972). 

4. -Expert analysis of the subject area to be tested. An in-depth 

analysis is made of an area— such as mathematics--in order to 
identify all knowledge and skills that must be acquired if the 
area is to be learned (Glaser and Nitko, 1971, Nitko, 1973). 

5. Theories of learning and instruction. A literature review is 
conducted and/or . consul tants called in to formulate series or 
hierarchies of educational tasks and purposes based upon the 
results of psychological theory and research (Keesling, J.W.,^ 

1975). 

/ 

'6. Empirical studies. Experiments are conducted in order to 
identify the objectives that are most important, because 
the skills and knowledge are inherently essential. 

No matter how they are derived, educational tasks and^purposes are 
usually called objectives or behavioral objectives. However, it should be 
noted that these ternis have a precise meaning to educators: /'An objective 
is an intent (author's italics) communicated by a statement describing a 
proposed change in a learner - a statement, of what the learner is to be 
like when he has successfully completed a learning experience" (Mager, 1962) 



8 

lU 



Developers of CRTs do not always use this definition in its purest sense 
(Hoepfner, 1975). To them, an objective refers to the content that is 
supposed to have been learned (e.g., equivalent and nonequivalent sets 
in sixth-grade, math) and sometimes includes the behaviors the student if. 
supposed to exhibit (e.g., naming the first five Presidents of the U>',i, 

Other issues concerning educational tasks and purposes, that is, ob- 
jectives, relate to the rules needed for writing objectives and how broadly 
or narrowly they should be stated. Formal rules for generating and stating 
objectives are needed to ensure the uniformity, manageability, and compre- 
hensiveriess of the set of objecffves/or domain that the CRT measures.* 

Still another issue deals with how. a domaiivis organized. The objec- 
tives for a single domain can be grouped by grade levels; they can be. 
organized according to major content ^reas; and/or they can be arranged in- 
to a hierarchy according to the complexity of the behaviors involved or the 
order of instruction. 

Formulating and generating items . Once the objectives for the CRT have 
been chosen, the next step is to construct and/or select test. items to mea- 
sure the objectives. This is one of the most difficult steps in the total, 
developmental process because of the vast number of test items. that might 

*The set of objectives that a CRT measures is sometimes called a domain 
or universe of content (Skager, 1975; Cronbach, 1971). However, the term 
"domain" is used by others to mean the rules for generating test items to 
measure a specific objective (Hively, et al,1973). 

n 



be constructed for any given objecti>ve, even those that are relatively 
narrov/ly defined (Klein and Kosecoff, 1973). For example, consider the 
following objective: "The student can compute the correct product of 
two Single-digit numerals greater than zero where the maximum value of ■ 
this product does not exceed thirty." The specificity of this objective 
is quite deceptive since there are fifty- five pairs of numerals that 
meet this requirement, and at least ten different item types that might 
be used to assess student performance, as can be seen in Figure 1. 

Figure 1 

' Types 6f CRT test .items using the numerals 3 and 5 

The student can compute the correct product of two single-digit 
numerals greater than zero where the maximum value of this product 
does not exceed thirty. 

5 

a. x3 

b. 5 X 3 = ' . 

c. (5)(3) = 

d. 5 . 3 = 

e. 5 tiijies 3 = . . 

f . The product of 5 and 3 = ^ 

g. 5 X \_ = 15 

h. If x=5 and y=3, what is the value of xy? 

i. What numeral muUipled by 3 will equal 15? 

j. John has 5 apples.' Sally has 3 times as many apples 
as John. How many apples does Sally have? 



10 

12 



Further, each of the resulting 550 combinations of pairs and item 
types could be nx)dified in a variety of ways that might influence whether 
they were answered correctly. Some of ttiese QX)difi cations are: 

. vary the sequence of nurneraU (e.g., 5 then 3- versus 3 then 5) 

. use different item formats (e.g., multiple choice versus completion) 

f. 

. change the mode of presentation (e.g., written versus oral). 
. change the mode of response (e.g., written versus oral) 

It soon becomes evident that a highly-specific objective could have a 
potential item pool of well over several thousand items (Hively, 1970, 
et al, 1973; Bormuth, 1970). - " • 

The number of items to construct for each objective is influenced by 
several Uctors. Some of these factors are the amount of -testing time 
available and the cost of making an interpretation error-, sUch as saying 
that a student h^s achieved mastery when he or she has not. For some 
objectives, many items are needed in order to obtain a stable estimate of 
a learner's performance, whereas for other objectives, fewer items will 
suffice. ' 

A related issue in the construction and generation of CRT items is 
the degree to which the, items should be sampled with respect to their re- 
lative difficulty and possible content coverage within an -objective. It 
is a well-known and frequently-used principle of test construction that ' v; 
even slight changes in an item can affect its difficulty. The extent to' 

11 ^ . 



which the items vnthin an objective are sampled with respect to uiffiqulty 
has a direct bearing on the interpretation of, the scores obtainea. In • 
other words, if only the most difficult items are jjsed, the phrase, 
"achievement of the objective" has a very different meaning than if the 
items are sampled over the full range of difficulties. 

y • 

* Another issue concerns a CRT's instructional dependence. The instruc- 
tional dependence of a CRT refers to the extent to which it is designed 

for- use with a specific educational program (Baker, R,L., 1972;- Skager, r 

> h 

1973). CRTs with a greater degree of instructional dependence have objectives 
and test items that are associated with a particular curriculum or set of 
educational materials and techniques. CRTs with a smaller degree of instruc- 
tional dependence, on the other hand, contain objectives and test items 
that are not necessarily associated with the specific skills or content of 
an educational program. However, they still may ItaRfe, been developed from 
several educational programs and consequently, have objectives and items 
that reflect the bias inherent in these programs. Conversely, CRTs with, 
no instructional dependence are based on a domain of content and behaviors 
that is independent of any educational program, and therefore, can be to 
compare several different educational programs. 

Consideration of the various issues involved, in item generation for 
CRTs has produced a number of different strategies for generating and con- 
structing items: 



1. Panel of experts. 'A group of measurement and curriculum 
"experts' dov-ide which items to use based on their knowledge 
of and experience in the field (Zweig, 1973). 

2. Content/process matrix. Basically, a variation of_ the classical 
test construction technique, this approach involves developing 
for each objective a matrix of contents and behaviors (or tasks) 
to be assessed. Items are then systematically sampled within 
this matrix and perhaps along a third continuum of item diffi- ( 
culty as well (Wilson, 1973). 

-> 

3. Systematic item generation. Basic "item forms" or specifications 
are developed for each objective that define the range of item 
difficulties, all the relevant contents and behaviors, and stimu- 



lus and response characteristics of items that can be used to 
assess the objective (Hively, 1970, et al , 1973; Cronbach, 1971; 



Formulating score interpretation schemes . One of the distinctive fea- 
tures of a CRT is its anility to provide a means for describing what an in- 
dividual (or group) can do, knows, or feels without having to consider the 
skills, knowledge, or attitude of others. Consequently, CRT scores are 



reported and interpreted in terms Gf the level of performance obtained^tTTtt^^ 

respect to the objective(s) or domain on which the CRT is based. This type 
of score is very different from that used f9r norm-referenced tests in, which 
scores arc reported in term:;' of the performance of other individuals or 



Skager, 1973; Popham, 197b). 




groups . 



15 



13 



It should be noted that scores on CRT tests need not be limited to just 
a CRT interpretation. Other score interpretations can also be provided to 
expand upon the CRT interpretation (Klein, 1970; Cronbach, 1970; Ebel , 1972). 
An example of one way of combining cri terion-and norm-: referenced information 
is: "This school had an average score of 5 out of 10 on the objective (a 
CRT interpretation) which is one standard deviation below the nation..", 
average of 7 out of 10 (a norm-referenced interpretation). The idea of using 
both types of score interpretations is not new and does not reduce the theo- 
retical soundness of the score interpretation (Cronbach, 1970; Klein, 1970, 
1971). Combining score interpretations is particularly useful ^r describing 
what a student can be expected to be able to do and how exceptional or typi- 

'•V 

cal chis performance is. Some of the different scores that can be inter- 
preted in d CRT-sense are: 

1. "Actual score." The number or percent of items "correct" on a 
given objective, referring to the number of items actually 
passed on the test. 

2. "True score." An individuaTs or group's true level of performance 
on an objective, referring to the portion of the total universe of 
items for an objective that an individual or group could answer 
correctly. • (That is, if every possible item was tested, this score 

is the number of items that an individual or group would pass^) 

I 

3. "Mastery" of a given objective. This refers to whether an indivi- 
dual or group has achieved a pre-set criterion level* of performance. 
To be\ legi timate, the criterion level should be meaningful and 

) ., i« ■ 

14 



preferably empirically justifiable. For example, a criterion 
level of /7 out of 10 items has meaning if systematic study has 
shown that, those who reach this level can actually do something 
that others who have not reached this level cannot do, or if 
baseline data show that the average students achieves this level. 

4. ' Performance time. The tjme it takes (in class ^huurs or calendar 

days) for a student to achieve, a given performance level. 

5. Level readiness. The probability that the student is ready to 
begin the next level of instruction (this may -be based on both 
the numben of items correct and the pattern of answers given to 
these items). 

6. Item difficulty. The perce,ntage of students who "pass" each 
item; that is, the item's difficulty, (This scorers given 
mo^t often when only one. item is tested per-objecti ve, for 
example. National Assessment of Educational Progress.) 

7. Total oJ)jectives mastered. The number of objectives "passed" qr 
"mastered" by an individual or group. 

8. Total individuals who passed. The number of individuals or 
groups who "passed" or "mastered" each objective. 

15 



17 



Validation o f CRTs 

It is axiomatic that all tests and measures must be field tested before 
basing decisions upon them. When construction of the objectives and test 
items is complete, the_CRT must be analyzed and validated. This process can 
involve giving the test to students and studying their responses (response 
data) or relying upon review by experts (judgmental data). , 

Tnere is much ambiguity about the procedures appropriate for analyzing 
CRTs. Nevertheless, there are several dimensions of item and test quality 
that are considered to be relevant to CRT quality and that have associated 
with them review procedures, data collection strategies, experimental de- 
signs, and statistical » indexes. 

Establishing item quality . There are several comnonly considered di- 
mensions of item quality: 

1. Item-objective congruence. A test item i s considered ''good" if 
it measures or is congruent with the objective that it is sup- 
posed to assess. Item-objective congruence can be established 
by using judgmental data. Typically, -content experts are given 
a variety of objectives and the items us^d to measure them, and 
are asked to assign the items to their appropriate objective, or 
to conment ofi the appropriateness of the" item-objective relation- 

/ ship. 

.18 



2. Equivalence (internal consistency v/ithin objectives)^. An item 

is considered "good" if it "behaves" like other items measuring . 
the same objective. The concept is similar to item-objective 

i 

congruence, but its proper use depends on response data*. Equiva-^ 
lence is usually measured by computing the bfserial correlation' 
between the score. on an item and the total scpre on all items 
measuring that objective. ^ ^ * • 

3. Stability (ovqr time). An item is considered "good" if examinee 
performance is consistent from one test period to the next in the 
ab::ence of any special intervention (e.g., instruction is an 
intervention that can change examinee performance). Stability 
involves response data and can be measured by using a phi coeffi- 
cient that 'correlates scores on the item from two different 
occasions. '-^ * • o 

4. Sensitivity to instruction. An item can be considered "good" if 
it is sensitive to instruction; that is., if the item is able to 
discriminate between those who have and those who have not bene 

, Vited from instruction. This measure of item qual i ty is usually 
computed for CRTs that are linke^I t^ particular educational 
programs and requires response data. Typically; examinees are . 
tested before and after an educational program. Items that 
many examinees ^f ail before instruction, but'pass after instruc- 
tion, are considered to be sensitive to the instruction. 



17 

19 



5. Cultural/sex bias. An item is considered "good" if there are no 
systematic differences in performance across different cultural 
groups or sexes. Bias can be assessed using either judgmental 
or response data. If the former are used, representatives, of 

different cultural groups, members of each sex, and/or linguists 

' f 

examine te$.t items to determine whether vocabulary or content are 
foreign or could be friisf interpreted. If response data -are used to. 
assess bias, they are analyzed (typically using .ANOVA or regres- 
• sion for item-cul tural/seo< interactions). 

• 

Establishtinq test quality . There are six dimensions eornmor]ly used to 
express the quality of a CRT: 

1; Test-objective congruence. Similar, to item-objective congruence, 
test-objective congruence, assesses the extent to which the.-total 
test or subtest measures the relevant objective. Test-objective. 
coYigruence is usually determined by using .judgmental data. 

2. . Equivalence (internal consistenty). Test equivalence measures 
the homogeneity of test items for an objective, that is, how 
coherently the test items assess' the particular objective. This 
can be measured by using spl it-hal f correlation, Kuder-Richardson 
formulas, or coefficient alpha. 

3a. Stability (test-retest or alternate forms)'. A test is sfable to 
,the extent that examinee responses' are consistent from, one test, 
period to another or across alternate forms of, a test'in the , 
absence of any intervention. 

18 • 

. . / . . ■ 



. stability (number of items per objective and number of objectives 
per domain). There are two levels at which this type of stability 
for a CRT can be estimated. At the first level, a determination 
is made of the number of items that should be tested in order to 
obtain a stable <;core on an objective. For this type of stability, 
the assumption is made that for each objective there is a pool or 
population of items with mixed difficulties that ileal s with the 
objective, and that for any given test a sample of those items 
is selected. At the second level, a determination is made of 
the number of objectives that should be tested in order to 
obtain a stable estimate of performance on the domain. For 
this type of stability, the assumption is made that a single 
score is needed that describes an individual's performance on 
the domain . or set of objectives. Stability can be estimated 
with response data using correlation techniques and/or Bayesian 
models (Novick and Lewis, 1974). . • 

Sensitivity to ijristruction. Sensitivity to instruction refers 
to a test's ability to discriminate between those who have and 
those Avho have not benefited from instruction. This type of 
measure of test quality is usually obtained for CRTs that are 
linked tp a specific educational program. It Can be nieasured 
using response data by comparing test performance before and 
after instruction or 'by comparing scores of those who have and 
those who have not received instruction. 



19 

21 



CuUural/sex bias. Test bias refers to the existence of syste- 
matic differences in test performance across cultural/sex 
groups. This can be measured by ANOVA or regression techniques 
using response data or by expert review using judgmental data. 

Criterion validity. Criterion validity establishes the meaning- 
fulness of the criterion in terms of which CRT scores are inter- 
preted. Establishing criterion validity is either a one-step or 
a two-step process. 

* 

Step 1: The first step invplves assessing the meaningfulness 

of the domain: that objectives have been selected and 
organized to be in themselves educationally significant,' 
and that test items have been -systematically generated 
to cover the objectives. Step 1 criterion validity is 
usually established by having experts review the objec- 
tives and test items to determine the extent to which 
they were developed in conformance with pre-specif ied. 
procedures, and to which they cover the domain in a 
comprehensive and meaningful manner. 

Step 1 must be completed for all CRTs, and, in some case 
is sufficient for establishing criterion validity. « One 
example of a CRT that only requires Step 1 criterion 
validity is a CRT that is based on objectives that are 

20 

- * 22 



narrowly-defined and "operationally" stated in such detail 
that generating test items only requires transposing the . 
objectives into question form. CRT score interpretations 
for objectives with these characteristics are meciningful 
because the objectives describe skills that can be measured 
directly by test items.. A second case is when the CRT's 
objectives are linked to a curri^iulum jind its scores are 
intended for and interpreted by teachers and curriculum 
experts. CRT score interpretations in, terms of these 
types of objecti|^s are meaningful because the skills and 
knowledge being measured are those taught in classrooms 
using a specific curriculum. A third case in which Step 
1 validity is sufficient is when comparative data are 
provided, or when the CRT score interpretation is supple- 
mented by a normative interpretation, e.g., the class 
correctly ansv/ered an average of 7 out of 10 i tems^whereas 
in the district the average class achieved 5 out of 10. 

In Step 2, criterion validity is established through 

empirical means, and involves determining whether 

examinees "who perform well on the test have really 

achieved the educational objective. Step 2 criterion 

validity can be measured by comparing scores obtained 

on a CRT by individuals who, in advance of taking the' 
# 

CRT and, using -independent criteria, were judged to 
possess or not possess the skills that the objective 

. . ■ 21 
23 



is intended to measure. To the extent that the CRT 
discriminates between these tv/o groups of individuals, 
the CRT has criterion validity.* J, 

By establishing Step 2 criterion validity, the relation- 
ship betv/een test items and the objectives they are 
supposed to measure is confirmed. Step 2 criterion 
, validity permits assertions about mastery of the indi- 

vidual objectives that comprise a domain and about more 
complex behaviors whose component parts are defined by 
th^ domain. For example, if a reading test has Step 2 
criterion validity, then it becomes possible to make 
statements about mastery ot objectives, like: "John 
Doe car» identify the title sentence in a paragraph," 
and "John Doe can understand main ideas in a reading 
passage," as well as statements about mastery of a 
'domain, like: "John Doe can read well 'enough to com- 
prehend daily newspapers or best-selling novels." 

Step 2 criterion validity is particularly useful when 
objectives are not narrowly defined, only a CRTinter- 
.pretation is provided, and it may be difficult to 
; assume that achievement of the items necessarily re- 

flects .achievement of the larger objective or domain. ^ 

*Step 2 criterion^validi ty is similar to construct validity, but an 
objective or a domain^ rather than a psychological state, is the construct. 

* 22 - ' • ' 

ERJC . 



Establishing classical reliability and validity . There has been 
considerable debate over the appropriateness of "classical" indexes of 
r'eli ability and validity to cri terion-referencecf tests. Some psychome.tri- 
cians have argued that since CRT items are selected to measure achievement 
of specific educational objectives and not to discriminate between students, 
scores on CRTs can lack variation. Thjis could arise in the following 
situation: Before instruction, none of the students have mastered the^ 
objectives, and they might all receive a score of zero on the criterion- 
referenced pretest, whereas after instruction, they might all recetve very 
high scores on the cri teriorjHreferenced posttest. A lack of variation in 
student scores, it is claimed, would cause the traditional indexes of relia- 
bility and validity (that are. based on variance) to be inappropriate (Popham 

and Husek, 1959). 

J' 

Others have argued that v/hen CRTs are administered to a heterogeneous 
sample representing differing degrees of competence and receiving differing 
instruction on the Objective, there will be sufficient variation in test 
performance to apply the classical statistical formulas (Klein, 1970; Harris, 
1973). This latter stance is becoming the accepted view, and it is now held 
that the classical indexes (e.g-, stability, equivalence) can be estimated 
for CRTs using a heterogeneous population. 

»^ « 

CRT's Theoretical Appropriatene s s for EvaTuation Purposes — 

Relying on the preceding theoretical discussion of the development and 
validation of CRTs, it is possible tp ask: 



Based on tkeDr^:i2Cil cqt.siderations atom, are CRTs a:)propHate 
to measui'c ac'il s --ent for large-saate, effectiveness evaluation? 

The' answer. to this question is yes. An effectiveness evaluation re- 
quires instruments that are reliable and valid and provide meaningful 
scores that can be used to make decisions about educational policy. In 
theory, there is-an orderly set of developmental and validation procedures 
which, if followed properly, produce CRTs that are based oh well-defined 
sets of objectives and that can provide meaningful and useful score 
interpretations. Thus, from a theoretical perspective, CRTs are appro- 
priate and desirable for measuring achievement^ i n effecti venes.-; evalua- 
tions. However, there are important caveats attached to this conclusion, 

. First, t;-e e are persons who simply reject the notion of 
criterion-referenced testing, and with it, the meaning- 
fulness of any CRT score interpretations. If ail'evalua-. 
tion is being conmissioned by individuals who share this 
view, then CRTs should not be used since the resulting 
information, although theoretically sound, is likely to be 
ignored. 

. Second, as is the case with norm- referenced tests, not all 
CRTs provide the same type of score interpretation. Some 
CRTs report and interpret scores in terms of the number of 
items passed per objective, and many educators and policy- 
makers find this type of score interpretation by itself to 



24 

26 



be inadequate for most effectiveness evaluation purposes. 
However, rejection of this type of score interpi'etation is 
not equivalent to rejection of the notion of CRTs since there 
is no reason why CRT scores cannot be supplemented by compa* 
rative data. 

» 

4 

Review of Currently Available CRTs 

In this section, currently available CRTs are reviewed to determine 
if they are technically sound, and if they have been designed so that 
they can be easily used for a large-scale effectiveness evaluation^ To 
do this, a list of review criteria wejre generated and copies of currently 
available CR"s were obtained from publishers'. The CRTs were evaluated 
using the review criteria. Based on the results of the review, the prac- 
tical appropriateness of CRTs for evaluation purposes was discussed. 

Generating Review Criteria ' ^ ' . 

To structure the review of available CRTs, a -set of criteria were 
generated. The crite^^ia reflect the characteristics general ly'^accepted ,^ 
5s being necessary and appropriate for a large-scale effectiveness eval- 
uation. In order to obtain the criteria., several sources were consulted, 
including a review of the literature, requests for proposals issued by 
state and federal agencies involving large-scale evaluations, and criteria 
already-developed and used for reviewing achievement tests. The final set 



25 

27 



of criteria were critiqued and approved by senior researchers and adminis- 
trators on a major evaluation study. 

Qbtain ing CRTs 

A list of publishers of educational tests was compiled using test 
review books (Buros, 1965, 1972; Hoepfner et aK , 1970, 1971, 1974), per- 
sonal 'contacts, and library sources (Klein and Kosecoff, 1973). It should 
bo noted that publishers on the list were not necessarily known as market- 
ers of CRTs because it was not always possible to predict in advance who 
published CRTs and who did not, and because it was considered important -^to' 
include as many publishers as possible in the review. 

A -letter was sent to eech publ isher -that requested the following 
information about any criterion-referenced math or reading tests that they 
might have available. 

1. Detailed descriptions of the test battery at each available 
grade level (e.g., # objectives, # items, sublet matter 
covered...) 

2. Sample tests for reading and ma^tlj at each available grade level 

3. Lists of objectives or domains for reading and math at each 
available grade level , s> 

4. Directions for administering and scoring reading and math '■. 
tests at each available grade * ^vel 



26 

28 



5. An technical manuals, field test reports, expert reviews, or 
test analysis information 

6. Information about special features like scoring services or 
cassette-recorded directions' 

7. Cost information . ' 

8. Name and title of person to be contacted for additional information 

When publishers' responses were received, they v/ere sorted into three 
piles: a "totally irrelevant" pile (e.g., tests purpor|;ing to measure 
science, math, hancwri t^ing, and aptitude for medical school ) ; a "possibly 
interesting, but lacking sufficient inforntation for review" pile (e.g*, 
brochures without copies of tests or test manu,cils; tests of verbal ability, 
but not rea *ing; r-^sponses from incjividual researchers who had tests that 
were not ready for publication); and a "poten^'^^al CRTs" pile (e.g., any 
publisher who claimed to have a CRT in reading and/or math and who provided, 
at the minimum, copies of the t.est(s) and test nanuals)./ Only the 28 CRTs 
in the third pile were reviewed. 

Each CRT was independently reviewed twice using the set of criteria 

/ • ■ 

generated for this, purpdse and discrepancies were resolved by the two re- 
viewers. Any remaining questions, that is^ those usually resulting from 
unclear or insufficient information from the publishers, were followed-up 
with a phone call to the publisher. 



27 * 

29 



Explaining Review Criteria 

There. were nineteen cri teria against which CRTs were reviewed. (A^^ 

copy. of the forms used by reviewers can be^ound .in Appendix A.) For 

^ ' 4 

this review, reading and language arts were considered to be ona or mach 
subject.^area and mathematics a second subject area. All subtests or tests 
of individual objectives at the same level were grouped together and con- 

\ 

sidered as a single reading or math test. In addition, the criteria were 
especially designed in order to p^riTiit cross-grade level and longitudinal 
comparisons ^hat typify large-scale evaluations.* 

h. Coverage of specific skills. A test must (in the reviewer*s 
opinion) cover sMTls in reading (language arts) and/or mathema^ 
tics. Examples of basic skills are' reading comprehension, 
spelling, arithmetic, and telling time as compared to tangential 
skills like using thr library or computation with a slide rule. 

2. Grade-l^vel coverage. Forms of the test must be available for 
grades 1 through 9. (This criterion makes possible comparisons 
across grade levels as well as longitudinal comparisons). 

3. Overlap of objectives across ,,grade levels. In the reviewec*s 
opinion, .sonfe or all of the test's objectives must be'measured 
at each grade level in order to make compai .soas across grade 
levels or over time in terms of common educational objectives 

*This investigation focused on CRTs that were developed for grades 
1-9 since mst currently available CRTs have been developed for those 
grades. ^ 



or skills. For this criterion, objectives on. test items 
at different grade levels need not be worded identically. 
For example, a test item at the second-grade level nilght have a 
student read a sentence and select from a series of four pictures, 
the one that, best depicts the sentence; while a parallel but more 
complex test item af^'the ninth-grade level might have a student 
read a paragraph, and select one out of tour sentences that best 
summarizes the paragraph. For this review, the test need not 
provide a formal means of identifying those test items or objec- 
tives that are measured at different grade levels. 

Number of test forms per grade level. Due to constraints related 
to test administration and the time available for t^ing, there 
should be a* limited number of test forms at each grade level. 
Just one test per grade level is preferred in order to avoid 
problems"^wi th reliability that can arise when several test forms ^ 
are combined. 

Complete directions for test administration. A te^t should provide 
(in the opinion of the reviewers) thorough and clear instructions 

c 

i 

for both the qxaminer and examinee. Directions concerning distribu- 
ting tests, demonstrating sample questions, and test administration 
should be provided in a detailed and easy- to-read form. . 



2a 

31 



Special equipment needed for test administration. . Test adminis-. 
tration should not involve any special equipment (like cassettes 
or visual aids) aside from pencils and scratch paper. 

Time for testing. A test (reading or math) should be designed 
to be completed within a given class period. This usually in- 
volves no more than a maximum of 40-60 minutes.' 

.4 

Group testing. A test must be designr.d for group administration. 

Item-objective match. Each test item should be coded to an 
objective (or the educational tasks and purposes the test claims 
to measure) . 

Objective coverage. There should be (in the opinion of the reviewer^) 
a sufficient number of items to adequately measure each objective. 
The number of items per objective should vary as a function of how 
broadly or narrowly an objective is stated and its level of' 
difficulty. 

I 

i 

Objective/subjective scoring. A test must use an objective scoring 
procedure. 

Machine scoring options.. The test must be available in or adaptable 
to a machine-scoring. .. 

Score interpretation scheme. A test must employ a criterion-referenced 
score interpretation scheme. Tests using CRT interpretations in addi- 
tion to other types of score interpretation i^chemcs were also accepta- 
ble for th1< criterion. 32 ' 

30 



14. Reusable materials. Due to monetary constraints, it is preferable 
that test booklets*'"and test mariUal?. be reusable. 

15. Curriculum dependence. A test should not be based on the objectives 
of any particular curriculum or educational program. 

16. Costs of tests per pupil*. The costs of testing pupils must be 
affordable for a large-scale ^tudy. 

17. Formal field test. A test should provide documentation of field 
test activities. It is preferable that the field- test participants 
be nationally and geographically representative, be a probability 
saViftle, and include sufficient numbers of minority persons to 

, estimate bias. x . 



18. Information on item quality. Information should be provided, based 
'either on judgmental or response data, about item stability, sensi- 

tivity to instruction, sex/cultural bias, i teiH-objective congruence ^ 
and equivalence. 

19. Information on test quality. Information should be provided on test 
quality, based either on judgmental or response data, to include 
information about internal consistency, test stabi 1 i ty , test-; 
objective congruence, sex/cultural bias, sensitivity to instruction, 
and cri terion -validity. 



31 

33 



ERIC 



Results. of the Review ^ 

-\ 

In this section, the results of the twenty-eight tests reviewed for 
this study. are presented. Each individual reading or mathematics test is 
identified by a numerical code. The codes are necessary because the 
publishers submitted their materials voluntarily and did not formally con- 
sent to a published review. Further, because many of the 28 CRTs were 
intended for classrooms and not certification evaluation 'purposes, the 
revle^ conducted for this investigation tended to make some CRTs look 
less excellent than they would have if they had been reviewed from another 
perspective. The names of the publishers whose tests were reviewed can be 
found in the Appendix. 

1. ' Coverage of specific skills 

Of the twenty-eight tests reviev/ed, 15 were designed to assess 
only reading skills, and 13 were designed to assess only 
mathematics skills. All twenty-eight tests reviewed focused 
on measuring basic skills in reading and/or mathematics, r^ather 
th^n on tangential skills and thus 'met the criterion. 

2. Grade-level coverage 

Nine tests were available for grades K-9, and thus met the 
criterion. The remainder varied from CRTs available for 
grades K-2 to those available for grades K-t5. 



c 

32. 



3 4 



Overlap of objectives across grade levels 

Twelve tests appeared to measure the same objectives at all 
grade levels. Sixteen tests appeared to have. some over- 
lapping objectives which were measured at njost, but not all, 
grade levels, depending on "the appropriateness of the 
objective" and its level of specificity. It should be noted 
that to make coimon objectives, test publishers frequently 
used broadly-stated objectives or skill categories which they 
then "translated" into tasks and skills of varying complexity 
for different grade levels. 

Number of test forms per grade level 

Some GRTs had only one test form per grade level and others 
had as many as 31. Usually those CRTs that offered a limited 
number of test forms p^r^ grade level would include several 
objectives on a single test form, while those featuring more 
tests forms per grade-level would assess one or only a few 
objectives per form. Three tests did not set limits on the 
number of tests that could/be created from their bank of 
objectives and items. 

Complete directions tor test administration 

Twenty-seven of the tests met the criterion by providing ade- 
quate directions both to the examiner and examinfee for^test^ 
administration. One test provided for review no information 
about administration. ■ . - 

33 



Special equipment needed for test administration 
Twenty-six tests required rio special equipment for test 
administration and, therefore, met the criterion. Two tests 
required the use of tape recorders or cassettes, and one pro- 
vided no infomation. It should be pointed out that many of 
the 26 tests were specifically designed for use with special 
equipment and consider its omission to be relatively less 
desirable. 

. Time for testing ; . 

Only two tests met this criterion.. Most tests (24) left 
time for testing open, but from their len^tii^ appeared to the 
reviewers to take more Jr.an one hour of testing tirae. One 
CRT had no information about the time needed for testing. 

i. Group testing 

Twenty-five tests could be administered to groups and» 
therefore, met the criterion. Two tests were designed for 
individual administration only, and or.e did not provide 
this information. 

9. Item-objective match 

Twenty-six tests held each item coded to an objective and 
one CRT di'd not provide this information. 



34 

36 



Objective covev^age 

The items tested for each objective ranged from 1 to 
150 across the 28 tests. (It should be noted that the 
CRT with 150 items per objective was based on a computer- 
ized item bank from which tests of any length could be 
generated. ) 

Objective/subjective scoring 

Twenty-seven tests employed an objective scoring 
technique, meeting this criterion. One test employed a 
subjective technique, and one other CRT did not provide 
this information. 

Machine scoring option 

Eighteen tests met the criterion for machine scoring.. Nine 
CRTs were hand-scorabler only, and one'CRT did not provide 
this'^information* 

Score interpretation scheme 

Twenty-seven tests met the criterion by using-soroe type of 
.criterion-referenced score interpretation scheme. Over- 
whelmingly, the scheme was expressed as an arbitrary mastery/ 
non-mastery score or the number of itefns correct on a given 
objective. Of these same 27 tests, 7 also employed norm- 
referenced interpretations. One test did not describe its 
score interpretation scheme. , . 

' 35 

37 




^-^14. Reusable materials 

Twenty-four tests were designed so that at least some portion 
of the materials cculd be reused. These usually were the test 
booklets, when separate answer sheets were provided, and the 
teacher's and examiner's manuals. Three CRTs had no reusable 
materials, and one did not provide this information. 

15. Curriculum dependence 

Twenty- two tests appeared to have total independence from a 
particular curriculum or instructional' program. Six -other 
tests also appeared to be rather general and independent, 
although they claimed to be based in vajrying degrees on a 
review of what is currently being taught in today's schools. 

16. -Cost of tests per pupf^ 

Based on a purchase of tests in reading or math at the third- 
grade level, costs rangei from about five cents per student, 
to $6.31 per student. One test had to be implemented at the 
district Ifvel and cost $7500.00. Most tests are sold in 
sets of 30 - 35 test booklets. Jo compute costs, it was 
assumed'^ that an individual student counted 1/30 to 1/35' of 
■Uie total . » 



ERIC 



36^ 

38 .X 



17. Formal field test 

Eight tests provide documentation concerning field test 
activities. However, the information pro>/uied was remarkably 
sparse with several exceptions. Those who did conduct field 
tests 'J^ually attempted to get some, sort of geographic and 
national representation- Fifteen tests e-laim to have been 
field tested, but provided no supporting documentation and. 
^ five additional tests provided no information at all about 
field tests. 

18. Information on item quality 

Twelve tests reported having conducted item quality 
studies based on both response, data and/or expert review. 
Of these, attention typically was paid to item-objective 
congruence, item stability or equivalence, and sensitivity 
to instruction. Eight tests reported having some type of 
review but declined to state the kinds or extent of their 
studies. Eight other systems (|id not provide any informa- 
tion at all . 



37 

39 

ERIC. 



. Information on test quality 

Thirteen tests reported having conducted test quality studies 
based on response data and/or expert review, these, inter- 
nal consistency, stability, test-objective congruence, sensi- 
tivity to instruction, ahd criterion validity (Step I). were 
most frequently attended to. Seven other systems claimed to 
have performed test quality studies, but provided no supporting 
documentation. Eight additional systems provided no information 
at all . ' • 

igure 2 summarizes the results of tha review for each test. 



38 



Cxi 
LJ 
h- 

o 

1 1 o 1 


to 
1^ 

fj r.; 


• 1 

/J i-. 
**->* '*j 




CO 

*r: ^ --^^ 

C fir, 





L* 

'■^ ^-i 
•j •»» 

O "^^ 










i: 
^) 

" -i 

c 

! 


""J 
'i) 

•^^ 


\j 
*' ■> 

\ 

o r. 
'r^ •* 

o d 


'a 

•a 

o 


002 


P 


F 


s 


9-19 


P 


F 






P 


p 


3-5 


P 


F 


002 


P 




s 


11-31 


P 


r 






P 


p 


3-5 


P 


F 


003 


P 


F 


s 


4 


P 


p 




-- 


P 




3-4 


P 


P 


004 


P 


F 


s 


1 


P 


p 






P 


p 


1 


P 


P 


OOS 


P 


P 


s 


5-7 


P 


p 




F 


P 


p 


1-hO 


P 


P 


voe 


P 




s 


1-2 


P 


p 




P 


P 


p 


1-40 


P 


P 


007 


P 




5 




P 


p 




-•• 


P 


p 


4 


P 


P 


OOB^ 


P 




A 


2 




p 




P 




p 




P 


? 


oos 


P 


P 


s 


2-6 


P 


p 




— 


P 


p 


1-2 


P 


P 


010 


P 




s 


2-3 


P 


p 




— 


P 


p 


1-3 


P 


P 


on 


P 




s 


2-9 


P 


p 






P 


p 


2 
2 

45-150 
I'll 


P 


P 


022 


P 


f 


s 


4-5 


P 


p 






P 


p 


P 


P 


013 
024 


P 

P 


F 


A 
A 


5 

14 


P 
P 


p 
p 




— 


F 
F 


p 
p 


P 
F 


r 
F 


02S 


P 




s 


1 


P 


p 




— 


P 


p 


3 


P 


P 


026 


P 


P 


s 


1 


P 


p 




— 


P 


p 


3 


P 


P 


02? 


P 


F 


A 


3 


P 


p 




— 


P 


p 


5 


P 


P 


028 


P 


F 


A 


3 


p.. 


p 




— 


P 


p 


5-10 


P 


P 


029 


P 


F 


S 




P 


p 






P 


p 


1-20 


P 


P 


020 


P 


F 


s 




P 


p 




-- 


P 


p 


1-20 


P 


P 


022 


P 


F 


A 


1 


P 


p 






P 


r 


5-10 


P 


F, . 


022 


P 


F 


S 


1 


P 


p 




-- 


P 


p 


5-10 


P 


F 


023 


P 


F 


A 


1 


P 


p 






P 


p 


2-5 


P 


P 


024 


P 


F 


A 


1 


P 


p 




— 


P 


p 


2-5 


P 


P 


025 


P 




A 




P 


p- 






P 




5-20 


P 




U26 


P 


p 


A 




P 


p 






P 


F 


36 


p 


F 

p 


027 


P 




A 




P 


p 






P 




6-36 


P 


r 


02B ' 


P - 




A. 




P 

1 


p 




--\ 


pj 


pi 


/ 

^-5^0 




F 



Key to Figure 2: 
P " Pass 
F = Fall 

No information 
= Open 



S « Soa>e 
A = Always 
N = None 



test is n-.-i /cr ^iVciilaMc to the public 



41 

o 

ERIC 









. 


— — 




























s 
















"J 


1 










'J 












•»> o 


f _k 












t, 






















» 






''J 






















Cr: 










I 




















r + 1 
'•' 5'' 












fj 


."-^ 






I) 












i " 


''J 






1. 






1 














^) "J 

O 7. 




o 






^« 


O Hi 




'J ^ 

CO 


-4 


001 






f 






















OOZ 




• 


• 










P 


fi 


1 4,50 








003 




• 












P 


S 


$1 .57 


« 






004 




o 












P 


fi 


$1 .50 








005 








• 


1 






P 


w 


SI .78 




* 


« 


000 








1 








F 


w 


$1 .78 








007 
















P 


N 










003 




• 




M 








P 


S 










CC9 










• 






P 


fl 


% .82 




• 


• 


020 








r 


• 






P 


N 


<; ,68 


f 






022 




• 












P 




)$2.75- 








022 




• 












P 


S 


J 3.61 ; 








' OIZ 




m 






• 






P 


N 


$ .81 




t 


• 


024 








< 








P 


N 


S .95 








015 




• 












P 




$1,70 








026 




• 












p 


N 


$1 .48 








017 






• 










P 


N 


$6,31 








018 






• 










P 


N 


$5.96 








029 
















F 


fr 


1 $7500. 


- 






020 




• 


• 














to 








021 
















F 


) start 






















P 


N ' 


$ .05 








022 
















P 


W 


% .05 








023 




* 






4 






P 


S 


$1.00 


• 


ff 




024 




r 






i 






P 


S 


$1.00 . 


• 


• 




025 


















11 
















• 










\^ 


$ .31 








02C 


















«i 














• 










P 




$ .75 








027 


















«! 






















F 










028 
















P 













Key to Figure 2: 



P Pass 
F = Fan 

= flo i f) forir,] i on 
= Open 



S = Some 
A * Always 
N ' Jlone 

ii^r Some discussion of 



42 





28 


2^ 






ition on Iterr: Quality 




Infopriation on \ 




uality 








a: 

UJ 

fr— 
>— 1 
en 
o 

TEST 


o 

;?) i: 
•t> At 

V) ■i> 

s.r 

,^ t: 
to 


1 

to 




*j 


1 

is* i..* 


o 


I? 
r o 


1 

*• > 

CO 


CO 

ri 

o 
» 

\\ o 


"J --^ 


o 


o 


f 

Z :i • 

-1 '-3 




002^ 






























002 


> 






• 






















003 


9 






















f 






004 


















• 




• 








00£> 




• 




« 






















006 




• 




f 










• 


• 










007 








• 










♦ 












00$ 








i 




f 


















009 


* 












• 


• 








t 






010 












] 

I 












« 






022 
























ft 






C12 










• 














• 






023 




• 












• 




• 










024 


























« 




OlS 






























026 






























027 






























028 










% 
















• 




019 




























• 


020 












• 
















• 


022 












ff 
















• 


■ 022 












• 


















023 


























a 




024 






























025 












* 
















• 


026 










• 










# 




• 






02? 








i 






















028 












1 * 
















• 



Key to figure 2: 
P = Pass 
F « Fail 

«^ No Infomiatlon 
=^ Open 

43 



S « Some 

A = Always ^ 

N « None ^ 



CRTs Practical Appropriateness for Effectiveness 'Eval : .j tion Purposes 

Relying on the preceding discussion of the characteristics of 
currently available CRTs, it is possible to ask: 

'Based on practical considerations alone ^ are CRTs 
appropriate for large-scale effectiveness evaluations? 

The answer to this question is no. From the review, it is clear that 
although no CRT met all the criteria, there are several CRTs that are poten 
tially feasible for effectiveness evaluation purposes. However, using one 
of these tests, would involve considerable effort to adjust it for an evalua 
tion situation. Specifically, the review uncovered some practical problems 
that diminish^currently available CRTs' suitability for an effectiveness 
evaluation. They are: 

1. Many learning objectives. Most of the CRTs reviewed had a large 
number of very specific learning objectives that were associated 
with very small units of instruction, lilce one to five class 
lessons. The reason for the use of ^ many, narrowly-defined objec- 
tives can probably be traced to CRTs* original use by teachers as 
one of their regular instructional aides in individualizing and 
evaluating instruction. Nevertheless, an effectiveness evaluation 
of the impact of just one year of instruction at one grade level, 
using such a CRT, would generate infor^mation about an enormous 
number of objectives, thus complicating the managenient, analysis, 
and reporting of data. 

39 

44 



Numerous test forms. Many currently available CRTs'provide at 
each-grade level separate test forms each measurir;)g just one or 
a few different objectives. For example, of the 28 tests reviewed 
some had up to 31 separate test forms per grade level. The appear- 
ance of many test forms also probably reflects the original inten- 
tion to use CRTs as classroom aides. In terms of an effectiveness 
evaluation, the logistics of administering a number of distinct 
tests complicates information collection activities and increases 
the chances of making errors as well as the costs of conducting 
the evaluation. 

Time required for testing. Most available CRTs take more tKan an 
hour of class time. For example, the review found that 23 of the 
28 publishers claimed that their tests were untimed and thus left 
pacing to the discretion of the examiner; however, based on the 
number of test items, it is clear that that one hour of test time 
is insufficient. In terms of the schedules of most evaluation 
studies, one class period of testing is the maximum time that can 
usually be devoted to CRT. 

It should be noted that some of the test publishers, recogni- 
zing time constraints, offered CRTs that had just one item per 
objective. However, this is not a satisfactory solution since 
reduction in the number of items will almost invariably bring 
with it a diminution in the test's ability to measure with 
precision each of the objectives. 



40 



Matching CRT*s objectives to instruction. Using CRTs in effec- 
tiveness evaluations that involve more than one educational 
program means determining relationships betv/een the CRTs* objec- 
tives and the programs* so that achievement can be measured in 
terms of the objectives emphasized in instruction and exemplary . 
programs can be identified. However, obtaining this informal' on ^ 
is costly and complicated. Teachers can be asked, for example, 
to rate the CRTs* objective in terms of their relevance to class- 
room instruction, but teacher ratings can be unreliable, Instruc 
tional experts can be asked to analyze textbooks and curriculum 
guides; hov/ever, they cannot know for certain hov/ these materials 
are being used in the classroom. 

Another problem closely associated to that of relating CRT and 
instructional objectives concerns which objectives to test. 
Each student or classroom can be tested on just those objec- 
tives that are derived from the curriculum being used; or can 
be tested on a sample of objectives some of which may be rele- 
vant to the curriculum, while the others are not. Depending 
upon the choice, the resulting evaluation infomia^tion can be 
limited in its ability tc be lised-xin making comparisons or can 
require considerable manipulations before interpretations can 
be made. 



41 
4(i 



V 



5. Identifying conmon objectives./ A fifth problem with using CRTs 
in effectiveness evaluation studies is that the same objectives 
are not always measured at all grade levels, or, if they are, 
there is no system for identifying common objectives. Although 
the skills and content associated with an objective generally 
become more complex with increa<^ing grade levels, it i5 necessary 
in-order to make comparisons over time or across grades to identify 
skills or objectives that are related in terms of a conceptual 
framework or general content area. For example, in the fourth 
grade, a punctuation objective might focus on beginning sentences 
with capital letters and ending them with' periods, while in the 
ninth grade, a punctuation objective might focus on the proper use 

■ of semicolons as alternatives to periods. Although both these 
objectives deal with the same skill area, granmar, unless they 
were formally referenced to that general , ski 11 area, the evalua- 
tor is faced with the responsibility of making this instructjonal- 
type of decision, one that is ordinarily not part of in his/her 
area of expertise. 

6. Validating CRTs. The procedures used to validate CRTs are not very ^ 
sophisticated and field test results are not reported in any detail. 
When Compared with the highly-structured field *ests conducted 

for norm-referenced tests, most CRTs are deficient with respect 
to the sample's .size and representativeness, and/or the amount of 
precision of data presented in technical reports. It must be 

' ' \ 42 

. 47 ■ • 

o 

ERIC 



noted that test publishers have probably been reluctant to devote 
time and money to field testing because test theorists have not 
been able to provide them with an agreed-upon set o.f procedures 
for analyzing and reporting field test data.- Assigning blame, 
however, is not the issue since the fact remains that a paucity 
of data is provided concerning the technical quality of tests and 
test items. 

CRT scores. Most CRTs report scores in one of ' two ways: either 
as the number of items correctly answered for each objective, or 
sometimes as mastery or non-mastery scores, where "mastery" means 
corectly answering an arbitrarily— selected number of items per 
objective. These types of score interpretation are accepted by 
theorists as a legitimate way of expressing^ CRT test scores and 
they may ha\Ae" meaning for teachers who knoy their curriculum. 
However, for effectiveness evaluation purposes, these types of, 
interpretations alone are inadequate because they provide insuffi- 
cient information for decision making and lose meaning outside 
the classroom. 

Financial considerations. A final practical problem with using 
currently available: CRTs fpr seffectiveness evaluation purposes is 
that most are costly. This probably reflects the effort it takes 
to define domains' and to produce the special feature offered by 
CRTs like referencing the objectives to various school curriculums 
and providing many short test forms that can be used- efficiently 
for classroom instruction purposes. 



43 

48 



Conci usions 

In previous sections, theoretical and practical characteristics of 
CRTs were examined. In this section, the results of those examinations 
are synthesized in order to determine the feasibility of using criterion- 
referenced tests to measure achievement in an effectiveness evaluation. 

The Feasibility of Using CRTs in an Effectiveness Evaluation Context 

There is no currently available CRT that is feasible for use in 
large-scale effectiveness evaluations. This conclusion is based on 
practical, not theoretical, considerations. One major reason for the ^ 
likely inappropriateness of available CRTs-is that many of them have 
been designed for classroom and not evaluation, purposes, and conse- 
quently, are characterized by numerous, narrowly defined objectives, 
each measured on a separate test form. In the- context of an effect- 
iveness evaluation, these CRTs produce unwieldy amounts of information, 
require too much time for testing, and create logistical problems for 
test administrators. 

A second major practical failing of currently available CRTs is 
that field tests are either not documented or are performed inadequately. 
As a result, the reliability and validity of these CRTs is simply not 
known, and it is inappropriate to provide decision makers with informa- 
tion of unconfi rTOd quality. 



44 

49 



A third major failing of available CRTs is that the score inter- 
pretations given are not as meaningful as can be expected. Most are 
presented as numbers of items passed, without Step 2 criterion validity 
information or comparative data as supplements. Other practical findings 

include the costs of CRTs and the absence of mechanisms for tracking the 

i 

same skills or objectives across grade levels. 

A CRT that is feasible to use to measure achievement in an effective- 
ness evaluation should be based on a limited set of objectives that repre- 
sent essential competencies and basic skills, be proven reliable and valid, 
and be able to provide scores that are meaniiigful and useful. 



45 

50 



References 



51 

o 

ERIC 



REFERENCES 



/ 

Alkin, M.C., Kosecoff, J, , Fitzgibbon, C, and Seligman, R. Evaluation 
and- Decision Making: The Title VII Experience. CSE Monograph No, 4 
Center for the Study of Evaludcion, University of California, 
Los Angeles, 1974. 



Baker, R.L. Measurement considerations in instruction product develop- 
ment. Paper presented at Conference on Problems in Objectivts Based 
Measurement, Center for the Study of Evaluation^ University of 
California, 1972. 



Bormuth, J. P. On the Theory of achievement test items. Chicago: 
University of Chicago Press, 1970. 



Buros, O.K. (Ed.). The Mental Measurements Yearbook . Highland Park, 
New Jersey: Bryphon Press, 1965, 1972. 



Cronbach, L.J. Esse ntials of Psychological Testing. (3rd ed.) fiew York: 
Harper, 1970. ' 

Cronbach, L.J. Test validation. In L. Thorndike (Ed.), Educational 
Measurement (2nd ed). Washington, D.C.: American Council on Education, 
1971. 



Cronbach, L.J. & Suppes, . Ed. Disciplined Inquiry for Education 
National Academy of Eriucction: 1959. 



Davis, F.B. , and Diamond, u.J. The Preparation of Cri terion-Referencea 
Tests, CSE Monograph No. 3 Center, for the Study of Evaluation, 
University of California, Los Angeles, 1974. 



Ebel , R.L. Evaluation and educational objectives: Behavioral and other- 
wise. Paper preseated at the Convention of the American Psychological 
Association, Honolulu, Hawaii, 1972. 



Fink, a\ and Koseccff, J. Evaluation Primer . Book in preparation 1976. 



46 

52 



Glascr, R. Instructional technology and the measurement of learning outcomes: 
Some questions. American Psychologist, 1963, Ici, 519-521. 



Glaser, R., & Nitko, A. Measurement in Learning and instruction. In R.L. 
ThorndiS'e (Ed.), Educational Measurement (2nd ed.). Washington, D.C.: 
American Council on Education, 1971, pp. 6520570. 



Harris, C. Cormients on problems of objectives based measurement. Paper 
presented at Annual AERA meeting. New Orleans, 1973. 



Harris, M.L., & Stewart, D.M. Application of classical strategies to 
criterion-referenced test construction. A paper presented at the annual 
meeting of the American Educational Research Association. New York, 1971. 

Hively, W. Introduction to domain referenced achievement testing. 
Symposium presentation, AERA, Minnesota , 1970. 

Hively, W., Maxwell, G., Rabehl , G.,Sension, D., & Lundin, S. Domain 

referenced curriculum evaluation: 'A technical handbook- and a case study 
from the MINNEMAST- project. CSE Monograph Series in 'Evaluation, Volume 1. 
Center for the Study of Evaluation, University of California, Los Angeles, 
1973. 



Hoepfner, R. et a i . CSE Elementary School Test Evaluations . Los Angeles: 
Center for the jtudy of Evaluation, UCLA Graduate School of Education, 1970. 



Hoepfner, R. et al , CSE-ECRC Preschool /Kindergarten Test Evaluations 

Los Angeles: Center for the Study of Evaluation and Early Childhood Research 
Center, UCLA Graduate School of Education, 1971. 

Hoepfner, R. CSE Secondary School Test Evaluation: Grades 7 & 8. Los Angeles 
Center for the Study of Evaluation, UCLA Graduate School of Education, 1974. 



Hoepfner, R., 1975.^ A Theological Examination of Criterion-Referenced Measures 
Based onEVkin's MEAN test Evaluation Scheme; A Photographic Essay pp. 
21-109 Life, October. 



Hofstadter, R. Anti-Intel lectualism in American Life . Vintage Books, 1963. 




47 

K 1 



Keesling, J.W.. Identification of Differing Intended Outcomes and their 
Implications for Evaluation. Paper presented at the annual meeting of the 
American Educational Research Association, Washington, D.C., 1975. 



Klein, S.P. Evaluating tests in terms of the information they provide. 
Evaluation Comment, -.970, 2 (2), 1-6/ ED 045-699. 

Klein, S.P. Evaluating Tests in Terms of the Information They Provide, 
Evaluation Comment , 1971 2 (2), 

Klein, S.P. An evaluation of New Mexico's educational priorities. Paper 
presented at Western Psychological Association, Portland, 1972. 
TM 002 735. (ED number not yet available.) 

Klein, S., Fenstennacher, G. , and Alkin, M. "The Center's Changing Evaluati 
Model," Evaluation Conment , 1971 2 (4). 

Klein, S.P., & Kosecoff, J.B. Issues and procedures in the development of 
criterion-referenced tests. ERIC/TM Report 26. Princeton, 11. J.: ERiC 
Clearinghouse on Tests, Measurement and Evaluation, 1973. 

Mager, R.F. Preparing instructional objectives. San Francisco: Fearor,, 
1962. 

Nitko, A.J. Problems in the development of criterion referenced tests. 
Paper presented at Annual AERA Meeting, New Orleans, 1973. 



Novick, M.R., and Lewis, C. Prescribing Test Lengtn for Criterion- 
Referenced Measurement. CSE Monograph No. 3 Center for the Study of 
Evaluation, University of California at Los Angeles, 1974. 



Popham, W.J. Educational Evaluation . New Jersey: Prentice-Hall, 1975. 

Popham, W.J., & Husek, T.R. Implications of criterion referenced measure- 
ment. 'Journal of Educational Measurement, 19G9, 6 (1), 1-9. 



48 



51 



ager, R, Genor'ating critorion referenced test- roni objectives based 
assessment systems: Unsolved problems in test Gevelopment, assembly 
and interpretation. Paper presented at Annual AF.KA Meeting, Now Orleans, 
1973. 



Skager, R. Critical Differentiating Characteristics for Tests o"" Educational 
Achievement, Pap^r presented at the annual meeting of the American 
Educational Research Association, Washington D.C. 197b. 



Wilson, ri.A. A humanistic approach to criterion' referenced testing. Paper 
presented at Annual AERA Meeting, New Orleans, 1973. 



Wilson, H.A, A judgmental Approach to Criterion-Referenced Testing, 
CSE Monograph Ho. 3 , Center for the Study of Evaluation, Universii - of 
California, Los Angeles, 1974. 



Zweig, R., & Associates. Personal communication, March 15, 1973. 



49 



I 



APPENDIX A 



5(.) 



ERIC 



TESTS REVIEWF.D 



Uanie of Systeni 



Fountain Valley Tea:.her 

Support Sybtem-rleadiP'] 

Fountain Valley Teacher 

Support Systeni-Math.-Jiatics 

Prescriptive Reading Inventory 

Diagnostic Mathematics Inventory 

Comprehensive Tests of Bar'c Skills 
Form S (CTBS/S)-Reading 

Comprehensive Tests of Basic Skills 
Form S (CTBS/S)-Matheniati cs 

ORBIT (Objective' s-Referenced Bank of 
Items and Tests) 

Skills Monitoring System-Reading 

1973 Stanford Reading Tests 

1973 Stanford ttethematics Tests 

Individualized Criterion-Referenced 
Testing-Reading 

Individualized Criterion-Referenced 
Testing-Mathematics 

Woodstock Reading Mastery Tests Form A 

Key Math (Diagnostic Arithmetic Test) 

Mastery: An Evaluation Tool, 
SOBAR, Reading 



Publisher 

Richard Zweig, Association, Inc. 
Richard Zv/eif, Association, Inc. 

CTB/McGravj-Hill 
CTB/McGraw-Hill 
CTB/McGraw-Hill 

CTB/ McGraw-Hill 

CTB/McGraw-Hill 

Hartcourt, Brace, Javanovich, Inc. 
(not yet available) 

Hartcourt, Brace, Javanovich, Inc. 

Hartcourt, Brace, Jiivanovich, Inc. 

Educational Development Corporation 

Educational Development Corporation 

American Guidance Service 
American Guidance Service 
Science Research Associates 



50 



57 



TESTS RF: VIEWED 



Name of System 

Mastery: An Evaluation Tool, 
Mathematics 

Individual Pupil Monitoring 
Systems-Reading 

Individual Pupil Monitoring 
Systems-Mathematics 

C, iiiprehensive Achievement Monitoring 
vCAM) Maintenance Pkg. -Reading 

Comprehensive Achievement Monitoring 

(CAM) Maintenance Pkg.-Mathematics 

Objectives-Based Test Sets-Reading 

Objectives-Based Test Sets-Mathematics 

Reading-Analysis of Skills 

Mathematics-Analysis of Skills 

Tests of Achievement in Basic Skills 
(TABS)-Reading 

Tests of Achievement in Basic Skills 
(TABS)-Mathematics 

Reading Inventory Probe I 
Mathematics Inventory Tests 



P ubl is her 

Science Research Associates 

Houghton-Mifflin 

Houghton-Mi fflin 

National Evaluation Systems 

National Evaluation Systems 

Instructional Objectives Exchange 

Instructional Objectives Exchange 

Scholastic Testing Service 

Scholastic Testing Service 

Educational and Industrial Testing 
Service 

Educational and Industrial Testing 
Service ' 

American Testing Company 

American Testing Company 



58 



